Hindi-English Language Identification, Named Entity Recognition and Back Transliteration: Shared Task System Description

نویسندگان

  • Navneet Sinha
  • Gowri Srinivasa
چکیده

This paper presents an algorithm for word level language identification, named entity recognition and classification, and transliteration of Indian language words written in the Roman script to their native Devanagari script from bilingual textual data. We propose the construction of an extensive, hierarchical structured dictionary and hierarchical rule-based classifier to expedite word search and language identification. The proposed method uses lexical, contextual and special character features particular to Hindi and English. With a few modifications to the system, the present solution can be replicated for other languages. The system we have submitted shows the best performance in English token level precision (0.895) and the second best in Indian language token recall (0.915). The transliteration level f-measure is relatively low (0.15); this can be significantly improved with a more representative and exhaustive training data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared Task Report by DAIICT

This paper aims to address the solution for the Subtask 1 of Shared Task on transliterated search,a task in FIRE ’14. The task addresses the problem of data containing English words and transliterated words of Indian languages in English.The task calls for language identification and subsequent back transliteration into the native Indian scripts.The system proposed herewith implements Language ...

متن کامل

Named Entity Recognition in Hindi using Maximum Entropy and Transliteration

(NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource-poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have describ...

متن کامل

NEWS 2009 Machine Transliteration Shared Task System Description: Transliteration with Letter-to-Phoneme Technology

We interpret the problem of transliterating English named entities into Hindi or Japanese Katakana as a variant of the letter-to-phoneme (L2P) subtask of textto-speech processing. Therefore, we apply a re-implementation of a state-of-the-art, discriminative L2P system (Jiampojamarn et al., 2008) to the problem, without further modification. In doing so, we hope to provide a baseline for the NEW...

متن کامل

A Hybrid Approach of English- Hindi Named-entity Transliteration

In recent years, machine transliteration has gained a center of attention for research. Both machine translation and transliteration are important for e-governance and web based online multilingual applications. As machine translation translate source language to target language which results in wrong translation for named entities. Named entities are required to be translated with preserving t...

متن کامل

CRF-based Named Entity Recognition @ICON 2013

This paper describes performance of CRF based systems for Named Entity Recognition (NER) in Indian language as a part of ICON 2013 shared task. In this task we have considered a set of language independent features for all the languages. Only for English a language specific feature, i.e. capitalization, has been added. Next the use of gazetteer is explored for Bengali, Hindi and English. The ga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014